Optimizing DDA Code on a POWER5 Processor
نویسنده
چکیده
In this paper we take an existing scientific computation code, DDA, and optimize it to run on an IBM Power5 processor. The DDA code, originally developed by a Ph.D. candidate in physics, suffers from excessive execution time caused by a high number of cache accesses and a low rate of instructions per cycle. Our goal is to improve the code’s performance by making a series of optimizations in a step-by-step manner. The first and second stages of optimizations were done by selecting specific optimization parameters available from IBM’s compiler, xlC. Our next step was to perform handmade optimizations to the code, concentrating mainly on loop fusion techniques. Our last stage of optimization was to incorporate OpenMP into the code in order to take advantage of the dual-cores available on the Power5 system. By using the IBM High Performance Toolkit, we were able to record the change in number of L1 data cache misses and references, IPC, and execution time after each phase of optimization. Using the original source code with no optimizations as the base for our experiments, we were able to obtain a speedup of 12x for “compiler only” optimizations, and an overall speedup of 42x after all modifications were made.
منابع مشابه
Using Large Page and Processor Binding to Optimize the Performance of OpenMP Scientific Applications on an IBM POWER5+ System
Multicores are widely used for high performance computing and are being configured in a hierarchical manner to compose a multicore system. While this presents significant new opportunities, such as high inter-core bandwidth and low inter-core latency, it also presents new challenges in the form of inter-core resource conflict and contention. A challenge to be addressed is how well current share...
متن کاملA Tale of Two Processors: Revisiting the RISC-CISC Debate
The contentious debates between RISC and CISC have died down, and a CISC ISA, the x86 continues to be popular. Nowadays, processors with CISC-ISAs translate the CISC instructions into RISC style micro-operations (eg: uops of Intel and ROPS of AMD). The use of the uops (or ROPS) allows the use of RISC-style execution cores, and use of various micro-architectural techniques that can be easily imp...
متن کاملAdvanced virtualization capabilities of POWER5 systems
IBM POWER5e systems combine enhancements in the IBM PowerPCe processor architecture with greatly enhanced firmware to significantly increase the virtualization capabilities of IBM POWERe servers. The POWER hypervisor, the basis of the IBM Virtualization Enginee technologies on POWER5 systems, delivers leading-edge mainframe virtualization technologies to the UNIXtmarketplace. In addition to bei...
متن کاملA Study of the Influence of the POWER5 Dynamic Resource Balancing (DRB) on Optimal Hardware Thread Priorities
Simultaneous Multithreading, often abbreviated SMT, is a technique for improving the overall efficiency of superscalar processors with hardware multithreading. SMT permits a processor to concurrently execute multiple independent instruction streams every clock cycle, potentially improving processor throughput. However, this can introduce contention for shared resources amongst threads running c...
متن کاملIBM power5 chip: a dual-core multithreaded processor - Micro, IEEE
IBM introduced Power4-based systems in 2001. The Power4 design integrates two processor cores on a single chip, a shared second-level cache, a directory for an off-chip third-level cache, and the necessary circuitry to connect it to other Power4 chips to form a system. The dual-processor chip provides natural thread-level parallelism at the chip level. Additionally, the Power4’s out-of-order ex...
متن کامل